January 11, 2016

Reproducibility: who cares?

Science retracts gay marriage paper without agreement of lead author LaCour

  • In May 2015 Science retracted a study of how canvassers can sway people's opinions about gay marriage published just 5 months ago.

  • Science Editor-in-Chief Marcia McNutt: Original survey data not made available for independent reproduction of results. + Survey incentives misrepresented. + Sponsorship statement false.

  • Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.

  • Methods we'll discuss today can't prevent this, but they can make it easier to discover issues.

Source: http://news.sciencemag.org/policy/2015/05/science-retracts-gay-marriage-paper-without-lead-author-s-consent

Seizure study retracted after authors realize data got "terribly mixed"

From the authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates:

"The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness."





Source: http://retractionwatch.com/2013/02/01/seizure-study-retracted-after-authors-realize-data-got-terribly-mixed/

Bad spreadsheet merge kills depression paper, quick fix resurrects it

  • The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.

  • Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].

  • Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].

Source: http://retractionwatch.com/2014/07/01/bad-spreadsheet-merge-kills-depression-paper-quick-fix-resurrects-it/

Divorce study felled by a coding error gets a second chance

Divorce study retraction: Editor's note

  • "The research environment is fast-paced given the ethos to “publish or perish"."

  • "[…] research is becoming increasingly complex, with greater calls for transdisciplinary collaborations, “big data,” and more sophisticated research questions and methods […] data sets often have multiple files that require merging, change the wording of questions over time, provide incomplete codebooks, and have unclear and sometimes duplicative variables."

  • "Given these issues, I would not be surprised if coding errors were fairly common, and that the ones discovered constitute only the "tip of the iceberg."



Source: http://retractionwatch.com/2015/09/10/divorce-study-felled-by-a-coding-error-gets-a-second-chance/#more-32151

Reproducibility: why should you care?

Think back to every time…

  • The results in Table 1 don't seem to correspond to those in Figure 2.
  • In what order do I run these scripts?
  • Where did we get this data file?
  • Why did I omit those samples?
  • How did I make that figure?
  • "Your script is now giving an error."
  • "The attached is similar to the code we used."



Source: Karl Broman





Your closest collaborator is you six months ago,
but you don’t reply to emails.

- Mark Holder




Reproducibility: how?

Reproducibility checklist

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)
  • Can the code be used for other data?
  • Can you extend the code to do other things?

Ambitious goal + many other concerns

We need an environment where

  • data, analysis, and results are tightly connected, or better yet, inseparable

  • reproducibility is built in
    • the original data remains untouched
    • all data manipulations and analyses are inherently documented
  • documentation is human readable and syntax is minimal

Toolkit

Outline

  1. Scriptability \(\rightarrow\) R

  2. Literate programming \(\rightarrow\) R Markdown

  3. Version control \(\rightarrow\) Git / GitHub

  4. Other considerations

1. Scriptability

Point-and-click vs. scripting

  • Learning curve: Point-and-click software (supposedly) have shallower learning curves than scripting languages

  • Documentation: At a minimum, your code documents your analysis
    • And you can do better with comments and README files
  • Automation: Need to rerun your analysis with new/updated data? Just change the input file.

  • Collaboration: Sharing your analysis is as easy as sharing your scripts

Why R?

  • Programming language for data analysis
  • Free!
  • Open source
  • Widely used and supported across all disciplines
  • Can be used on Windows, Mac OS X, or Linux

Why not language X?

  • There are a number of other great programming tools out there that can also be used to improve the reproducibility of your analysis

  • The key is to use some type of language that will allow you to automate and document your analysis

  • Once you master one language you'll probably find it easier to learn another

Once in R

You could just type into the command prompt…

  • … but that doesn't help much with documentation

  • … but that doesn't help much with automation

2. Literate programming

Donald Knuth "Literate Programming (1983)"

"Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do."

"The practitioner of literate programming […] strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other."

  • These ideas have been around for years!
  • and tools for putting them to practice have also been around
  • but they have never been as accessible as the current tools

A better solution than just R

With RStudio you can combine your programming and your documentation

  • RStudio gives you a single environment to combine your documentation and your analysis
  • It runs on top of R
  • Gives you a bunch of really cool features that we'll explore throughout the workshop

Anatomy of RStudio

  • Left: Console
    • Text on top at launch: version of R that you’re running
    • Below that is the prompt
  • Upper right: Workspace and command history
  • Lower right: Plots, access to files, help, packages, data viewer

What is Markdown?

  • Markdown is a lightweight markup language for creating HTML (or XHTML) documents.

  • Markup languages are designed to produce documents from human readable text (and annotations).

  • Some of you may be familiar with LaTeX. This is another (less human friendly) markup language for creating pdf documents.

  • Why I love Markdown:
    • Simple syntax means easy to learn and use.
    • Focus on content, rather than coding and debugging errors.
    • Allows for easy web authoring.
    • Once you have the basics down, you can get fancy and add HTML, JavaScript, and CSS.

What is R Markdown?

Well, it's R + Markdown

  • Ease of Markdown syntax

  • Rendering of R code to produce output and plots

R Markdown: syntax

rmarkdown_text

R Markdown: code

rmarkdown_code

Example: Big Five Personality Test

big5 <- read.delim("raw-data/big5.txt") %>%
  tbl_df() # for formatting

View data

big5
## Source: local data frame [19,719 x 57]
## 
##     race   age engnat gender  hand source country    E1    E2    E3    E4
##    (int) (int)  (int)  (int) (int)  (int)  (fctr) (int) (int) (int) (int)
## 1      3    53      1      1     1      1      US     4     2     5     2
## 2     13    46      1      2     1      1      US     2     2     3     3
## 3      1    14      2      2     1      1      PK     5     1     1     4
## 4      3    19      2      2     1      1      RO     2     5     2     4
## 5     11    25      2      2     1      2      US     3     1     3     3
## 6     13    31      1      2     1      2      US     1     5     2     4
## 7      5    20      1      2     1      5      US     5     1     5     1
## 8      4    23      2      1     1      2      IN     4     3     5     3
## 9      5    39      1      2     3      4      US     3     1     5     1
## 10     3    18      1      2     1      5      US     1     4     2     5
## ..   ...   ...    ...    ...   ...    ...     ...   ...   ...   ...   ...
## Variables not shown: E5 (int), E6 (int), E7 (int), E8 (int), E9 (int), E10
##   (int), N1 (int), N2 (int), N3 (int), N4 (int), N5 (int), N6 (int), N7
##   (int), N8 (int), N9 (int), N10 (int), A1 (int), A2 (int), A3 (int), A4
##   (int), A5 (int), A6 (int), A7 (int), A8 (int), A9 (int), A10 (int), C1
##   (int), C2 (int), C3 (int), C4 (int), C5 (int), C6 (int), C7 (int), C8
##   (int), C9 (int), C10 (int), O1 (int), O2 (int), O3 (int), O4 (int), O5
##   (int), O6 (int), O7 (int), O8 (int), O9 (int), O10 (int)

Clean data

You can include script files in your R Markdown document:

source("code/01-data-cleanup.R")

View distribution of age

ggplot(big5, aes(x = age)) +
  geom_histogram()

summary(big5$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   18.00   22.00   26.26   31.00   99.00

Regress extraversion vs. neuroticism and gender

Extraversion: Seeking fulfillment from sources outside the self or in community. High scorers are social, low scorers prefer to work alone. Neuroticism: Being emotional.

m_ext_age <- lm(extraversion ~ neuroticism * gender, data = big5)
summary(m_ext_age)
## 
## Call:
## lm(formula = extraversion ~ neuroticism * gender, data = big5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.3125  -6.3391   0.0132   6.6079  26.0924 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)             15.202758   0.190240  79.913  < 2e-16
## neuroticism              0.297346   0.009615  30.925  < 2e-16
## genderMale              -1.893017   0.327308  -5.784 7.42e-09
## genderOther             -5.721794   2.177580  -2.628  0.00861
## neuroticism:genderMale   0.001576   0.015226   0.104  0.91755
## neuroticism:genderOther -0.008332   0.125205  -0.067  0.94694
## 
## Residual standard error: 8.854 on 19605 degrees of freedom
##   (24 observations deleted due to missingness)
## Multiple R-squared:  0.08003,    Adjusted R-squared:  0.0798 
## F-statistic: 341.1 on 5 and 19605 DF,  p-value: < 2.2e-16

Plot extraversion vs. age and gender

ggplot(data = big5, aes(x = neuroticism, y = extraversion, color = gender)) +
  geom_point(alpha = 0.5) +
  geom_jitter() +
  geom_smooth(method = "lm")

Suppose you want only teens

big5_teen <- filter(big5, age <= 19)
m_ext_age_teen <- lm(extraversion ~ age * gender, data = big5_teen)
summary(m_ext_age_teen)
## 
## Call:
## lm(formula = extraversion ~ age * gender, data = big5_teen)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8426  -6.9399   0.0037   7.0601  22.6662 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)     14.12536    1.43788   9.824  < 2e-16
## age              0.30091    0.08502   3.539 0.000404
## genderMale       6.78702    2.47559   2.742 0.006131
## genderOther      6.66006   11.01228   0.605 0.545342
## age:genderMale  -0.42066    0.14590  -2.883 0.003949
## age:genderOther -0.76174    0.66364  -1.148 0.251085
## 
## Residual standard error: 9.366 on 6740 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.005666,   Adjusted R-squared:  0.004929 
## F-statistic: 7.681 on 5 and 6740 DF,  p-value: 3.274e-07

Plot for only teens

ggplot(data = big5_teen, aes(x = neuroticism, y = extraversion, color = gender)) +
  geom_point(alpha = 0.5) +
  geom_jitter() +
  geom_smooth(method = "lm")

3. Version control

What is version control?

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.

Bad

Good

    2013-10-14_manuscriptFish.doc
    2013-10-30_manuscriptFish.doc
    2013-11-05_manusctiptFish_intitialRyanEdits.doc
    2013-11-10_manuscriptFish.doc
    2013-11-11_manuscriptFish.doc
    2013-11-15_manuscriptFish.doc
    2013-11-30_manuscriptFish.doc
    2013-12-01_manuscriptFish.doc
    2013-12-02_manuscriptFish_PNASsubmitted.doc
    2014-01-03_manuscriptFish_PLOSsubmitted.doc
    2014-02-15_manuscriptFish_PLOSrevision.doc
    2014-03-14_manuscriptFish_PLOSpublished.doc

Better - Saving everything together at once

Everytime you make a save, you zip the entire directory that your project files are in and save it with a date.

Best - Version Control

How does a version control system work?

  • Version control systems start with a base version of the document and then save just the changes you made at each step of the way.

  • You can think of it as a tape: if you rewind the tape and start at the base document, then you can play back each change and end up with your latest version.



From Software Carpentry.

  • You can then think about "playing back" different sets of changes onto the base document and getting different versions of the document.



From Software Carpentry.

Git for version control

  • Makes you fearless
  • Easy to set up
  • Allows you to take a snapshot of every stage of your project history
  • Takes up minimal space
  • Creates a easy navigatable map to the history of all changes made

  • Integrated with RStudio

Features of using a Hosting Service Like Github

  • Backup of your project
  • No need for a server: easy to set up
  • GitHub's strong community: your colleagues are probably already there
  • Provides tools to help enhance collaboration
  • A common location to share your work

Parting remarks

Parting remarks

  • Everyone struggles with reproducibility and it is a hindrance to moving science forward

  • Evan with a fairly simple analysis challenges were faced in four main areas: organization, documentation, automation, and dissemination

  • Over the two day workshop, data analysis tasks will become more complex as we gather more data and ask more complicated questions, so we need better tools and workflows to combat issues arising in these areas

Two-pronged approach

#1 Adopt a reproducible research workflow



#2 Train new researchers who don’t have any other workflow